Effectively utilizing publicly available databases for cancer target evaluation

Abstract The majority of compounds designed against cancer drug targets do not progress to become approved drugs, mainly due to lack of efficacy and/or unmanageable toxicity. Robust target evaluation is therefore required before progressing through the drug discovery process to reduce the high attrition rate. There are a wealth of publicly available databases that can be mined to generate data as part of a target evaluation. It can, however, be challenging to learn what databases are available, how and when they should be used, and to understand the associated limitations. Here, we have compiled and present key, freely accessible and easy-to-use databases that house informative datasets from in vitro, in vivo and clinical studies. We also highlight comprehensive target review databases that aim to bring together information from multiple sources into one-stop portals. In the post-genomics era, a key objective is to exploit the extensive cell, animal and patient characterization datasets in order to deliver precision medicine on a patient-specific basis. Effective utilization of the highlighted databases will go some way towards supporting the cancer research community achieve these aims.


INTRODUCTION
Oncology drug discovery aims to de v elop therapeutic agents that modulate biological processes to inhibit progression of cancer in the clinical setting. Drug discovery is a r esour ce-intensi v e process that can broadly be broken down into four steps: target identification and validation, hit identification and validation, lead identification and optimization, and clinical de v elopment ( 1 ). Each subsequent step is more resource intensi v e than the last, requiring significant financial investment. The average time for a new drug to go from the start of the process to approval is 10-15 years with costs exceeding $ 1 billion ( 2 , 3 ). Despite the significant r esour ce involv ed, ov er 90% of ne w oncology agents do not become approved drugs, mainly due to lack of efficacy and / or unmanageable toxicity ( 4 ). It is ther efor e essential that sufficient efforts are invested in the evaluation of a novel target before a pr oject pr ogresses into drug discovery. The main aims of target evaluation ( Figure 1 ) are to esta blish whether availa ble information suggests that an agent modulating target activity is likely to be efficacious and tolerable in the clinical setting, identify the patient population to target (clinical positioning) and determine whether the project is technically feasible (tractability) ( 5 ).  Figure 1. Target evaluation summary. Schematic diagram illustrating tracta bility, tolera bility, efficacy and clinical positioning information that is r equir ed as part of a target evaluation assessment to enable decisions to be made regarding progression of a project into the full drug discovery process. Information can be obtained from a variety of sources, including online databases as highlighted.
As efforts to identify new therapeutic interventions in oncology hav e mov ed fr om br oad-spectrum cytotoxic agents to selecti v e a gents a gainst specific targets d ysregula ted in disease or agents that modify the tumour immune response ( 6 ), our need to understand target biology in different settings has drastically increased. The complex nature of cancer progression and the interplay between tumour intrinsic effects and interactions with the tumour micr oenvir onment make it extremely challenging to generate all the information r equir ed for robust evaluation of a target. For novel targets, it is often the case that there is not a sufficient depth of literature availab le. Howe v er, in the post-genomics era, the e xtensi v e availability of omics datasets from cell lines, animal models and patient samples means that detailed information is available to support comprehensi v e target e valuation. As detailed in Figure 1 , there are se v eral sources for informa tion tha t can inform tr actability, toler ability, efficacy and clinical positioning assessments. One valuable source that we belie v e is currently under-utilized is the wealth of pub licly availab le da tabases tha t can be mined to genera te these da ta. Da tabase mining can be used to support the cancer r esear ch community achie v e tw o k ey objecti v es: The first is to ensure that targets are robustly evaluated prior to entering the drug discovery process, with the aim of reducing the high attrition rate at later stages. The second is to support the progression of targets that offer the opportunity to de v elop precision medicines on a patient-specific basis.
The wealth of available databases can, howe v er, present a significant challenge. How do we gain an awareness of the da tabases, learn wha t informa tion they can provide and understand their limitations? To help address these key questions, we highlight se v eral key freely availab le databases and discuss how they can be utilized as part of a target evaluation. To demonstrate the utility of the selected databases, we present selected key examples of data outputs that can be obtained. We also discuss the limitations of the available databases and suggest additional information that would be of use to the cancer r esear ch and drug discovery communities. Finally, we demonstrate how the database outputs can be combined to evaluate a novel target and demonstrate how outputs can be used to predict challenges associated with clinical de v elopment.

Cancer cell line
The primary mechanism of action of the majority of approved anti-cancer drugs is to directly inhibit proliferation or to induce tumour cell death ( 7 ). Demonstrating cancer cell line target dependency for proliferation and / or viability ther efor e pro vides k ey validation for targets whose function in disease is primarily mediated through cancer cell intrinsic effects. In addition to predicting efficacy of target inhibition, output from cancer cell line profiling databases can be used to guide clinical positioning and support the identification of suitable clinically relevant cellular models in which to test compound efficacy.
The Cancer Dependency Ma p (DepMa p) portal ( https: //depmap.org/portal/ ) is an initiati v e from the Broad and Sanger Institutes that houses large-scale RNAi and CRISPR screening from se v eral sources with the aim of identifying intrinsic vulnerabilities of cancer cell lines. In addition, the DepMap portal also houses small molecule inhibitor screening data to identify pharmacological sensitivities across large cancer cell line panels, which we cover in the 'Identification and profiling of tool compounds' section.
Within DepMap, data from large-scale CRISPR ( 8 , 9 ) or RNAi (10)(11)(12) screens in cancer cell lines have been combined. The Chronos ( 13 ) or DEMTER2 ( 14 ) algorithms are used to normalize datasets to produce a single integrated output. Output is presented in the form of a 'gene effect' score where anything below 0 represents a loss of viability, with −1 being the median score for common essential genes. Data integration allows over 1000 cancer cell lines for CRISPR and over 700 cell lines for RNAi to be compared. Combining screening data from different primary screens that have been performed using distinct methodologies could result in variability across datasets, diminishing the statistical power to identify cell line-specific vulnerabilities. Howe v er, a comparati v e analysis between two of the largest contributors to the CRISPR screening da taset demonstra ted tha t despite the dif ferences in experimental set-up, there was a high degree of concordance between the studies in terms of both cell line-specific dependencies and predicti v e biomar kers identified ( 15 ), which increases confidence in the use of DepMap CRISPR screening datasets to support cancer target evaluation.
In addition to target dependency information, DepMap also houses genomic and metabolomic characterization from the Cancer Cell Line Encyclopedia ( https://sites. broadinstitute.org/ccle/ ) (16)(17)(18)(19). This includes mRNA expr ession, cop y number and mutation status data, with recent efforts being made to expand this characterization to include mass spectrometry-based quantification of the proteome ( 20 ). Integration of this comprehensi v e omics dataset with e xtensi v e genetic and pharmacological screening gi v es the statistical power to make predictions to inform clinical positioning. This is exemplified by the DepMap output of dependent cell lines for the well-validated anti-cancer drug target KRAS (Figure 2 A). It is well established that gain-of-function mutations in KRAS are prevalent in pancr eatic, color ectal and non-small cell lung cancer ( 21 ) and act as a dri v er of tumour pro gression. The DepMa p output for KRAS demonstra tes tha t dependency on KRAS is strongly selecti v e, with a clear 'tail' of strongly dependent cell lines (gene effect score ≤−1) observed with both RNAi and CRISPR screening (Figure 2 A). Analysis of this dataset demonstra tes tha t there is a statistical enrichment of KRAS dependency in cell lines of a pancreatic and colorectal origin identified from the CRISPR target dependency screens (Figure 2 B), consistent with the high prevalence of gain-offunction mutations in these indications. Further evidence of the power of DepMap to identify selecti v e dependencies on a target based on genomic features of cell lines comes from the observation that the WRN DNA helicase was identified as a selecti v e vulnerability for microsatellite instable (MSI) cancer cell lines using CRISPR and RNAi screening data, now housed within DepMap ( 22 ). As a r esult, ther e is now significant commercial interest in de v eloping selecti v e WRN small molecule inhibitors for use in MSI tumours.
Cell Model Passports ( https://cellmodelpassports.sanger. ac.uk/ ) is an integrated component of the DepMap that offers its own user-friendly portal with genomic and clinical characterization of > 2000 cancer cell line models ( 23 ). The portal allows the user to search by cell line to identify all available information, including presence of dri v er m utations, mRN A and protein expr ession, cop y number alterations and sensitivity to drugs tested by the Genomics of Drug Sensitivity in Cancer initiative ( https:// www.cancerrxgene.org/ ) ( 24 ). Integration within DepMap allows correlations between cell features and target dependency to be identified.
Synthetic lethality, where co-deletion / inhibition of two targets results in a loss of viability that is not observed upon perturbation of one target alone, is a strategy that offers the opportunity to achie v e a therapeutic window due to limited effects on cells not bearing the cancer-specific alteration. The clinical relevance of this concept has been illustrated by the success of PARP inhibitors in BRCA1 / 2 mutant settings ( 25 ). There have been recent efforts to generate databases that can predict synthetic lethal interactions, such as Syn-LethDB ( 26 ), based on compiling synthetic lethal CRISPR screening data from literature. The power of these databases to identify novel targets is dependent on the quality and quantity of the compiled data and is currently limited by the scale of available synthetic lethality screening studies. In the coming years, we expect that these databases will have the potential to further inform clinical positioning strategies and identify novel targets that may only be revealed in a synthetic lethal setting.
Despite the power of cancer cell line databases, there remain se v eral key limitations that should be considered when interpreting outputs. Depletion or deletion of a target is not necessarily recapitulated in full by inhibition of target activity, which is currently the most common modality of novel anti-cancer agents. Ther efor e, deletion / depletion data should be interpreted with caution, especially for targets that have multiple activities, as exemplified by the several known kinase-independent functions of protein and lipid kinases ( 27 ). Another key limitation is that cell screening data housed in DepMap have been generated under standard cell culture media conditions. It has become apparent that the composition of standard cell culture media conditions does not accuratel y reca pitulate the in vivo environment and that utilizing physiolo gicall y relevant cell culture media can alter the metabolic profile of cancer cell lines in 2D and 3D growth conditions ( 28 , 29 ). Indeed, CRISPR screens performed in haemopoietic cell lines under physiological culture conditions result in the identification of specific gene dependencies not re v ealed in conv entional culture conditions ( 30 ). This demonstra tes tha t DepMap outputs will not always accurately reflect dependencies of cancer cell lines for certain targets, particularly those that play a role in metabolic processes, where dependency may be influenced by gene-nutrient interaction.
Cancer cell line growth in 3D conditions is belie v ed to mor e accurately r ecapitulate the in vivo environment ( 31 , 32 ) and CRISPR screens comparing gene dependencies in 2D v ersus 3D conditions hav e identified 3D culture-specific vulnerabilities ( 33 ). Known cancer dri v ers and genes found to be heavil y m utated in cancer were more likely to be identified by screening under 3D conditions, suggesting that there may be reduced target attrition at later stages of the drug discovery process, if 3D culture was used for novel target identification screens. A clear gap therefore exists for a da tabase tha t houses large-scale CRISPR screening performed using 3D growth and physiolo gicall y relevant culture conditions.
Another key limitation of screening data from cancer cell lines is that using cancer cell lines alone is not an accurate reflection of the complex situation in vivo , where the tumour micr oenvir onment, consisting of a range of cell types including stromal and immune cells, can play a key role in outcomes of therapeutic interventions. Lack of cancer cell intrinsic dependency on a target in this class would not ther efor e be indicati v e of response to perturbation in vivo . This is exemplified by zero cell lines being identified in DepMa p CRISPR or RN Ai datasets as being dependent setting, where they act by enhancing the T-cell-mediated anti-tumour immune response ( 34 ). A growing body of r esear ch highlights the role of the immune system in cancer initiation, progr ession, tr eatment and resistance to therapy ( 35 ). The interaction between malignant and immune cells is ther efor e an important consideration and se v eral anti-cancer agents act by dir ectly r egulating the tumour immune response. Genomewide CRISPR screens have been performed in immune cells such as macrophages and T cells (34)(35)(36) to identify factors r equir ed for cancer-relevant phenotypes, such as prolifera tion, viability and exhaustion. A da tabase tha t compiled this information to cross-r efer ence with cancer cell intrinsic dependency databases would be a step towards greater understanding of the impact of target inhibition in a more complex setting.
Once a target has been identified, database mining can also be used to identify interactors of the target of interest. Identifying known interactors can provide further mechanistic insight and lead to additional therapeutic opportunities, such as small molecule-mediated disruption of diseaserelevant pr otein-pr otein interactions (PPIs). It is also possible that more tractable targets that play a similar role in disease progression can be identified through such efforts. The STRING database ( https://string-db.org/ ) collates PPIs from se v er al sources, including liter ature mining, computational prediction and databases of interactions identified e xperimentally ( 36 ). Se v eral other PPI databases e xist, and a thorough comparison of coverage has been reported ( 37 ), demonstra ting tha t STRING has the highest coverage of e xperimentally v erified PPIs and is updated frequently.

Immune cells
The immune oncology field has been dri v en forwar d in recent years by the approval of therapeutic antibodies targeting CTLA4, PD-1 and PD-L1 ( 38 ). The success of such immune checkpoint inhibitors encouraged de v elopment of therapeutic agents against additional tar gets. Ho we v er, the outcomes of clinical trials for novel imm unothera pies have been underwhelming ( 39 ), highlighting the complexity of the tumour micr oenvir onment and the need to better understand the cross-talk between tumour cells and infiltrating components of the immune system.
Understanding the role a target plays in regulating the tumour immune response is key for all aspects of target evaluation. A particular immune cell may be the most appropriate model in which to measure target activity and dependency. Clinical positioning may be more accurately guided by tumour extrinsic effects and a role in normal immune functions will inform tolerability assessments. It is ther efor e recommended that the databases highlighted in this section are utilized as part of a comprehensi v e target evaluation.
Gene expression data are the most abundant omics data available for both patient-derived samples and cancer cell lines and are a commonly utilized resource for researchers looking to understand genotype-phenotype relationships. TCGA, for example, has gene expression data from > 10 000 patient tumour samples (see the 'Clinical' section). Howe v er, these data are deri v ed from bulk RNA sequencing (RNA-seq) that does not differentiate between the various cell types, such as immune and tumour cells, present in a sample. To address this issue, various groups have derived deconvolution methods to gi v e estimates of immune infiltration from bulk RNA-seq data, including Cibersort ( 40 ), Estimate ( 41 ), xCell ( 41 ), EPIC ( 42 ), quanTIseq ( 43 ), mM-CPcounter ( 44 ) and TIMER ( 45 ).
The Tumor IMmune Estimation Resource (TIMER) ( http://timer.cistrome.org/ ), de v eloped by the Liu lab at the Dana Faber Institute, provides a portal that allows r esear chers to explore the infiltration of immune cells in TCGA tumour samples and correlate this with gene expr ession, mutation and cop y number alter ations. Infiltr ation scores calculated with the aforementioned deconvolution algorithms are availab le. Howe v er, cross-comparison of output from the different deconvolution methods is important as no single method is completely accurate and in some cases the infiltration scores can e v en be conflicting. Such data should ther efor e be used primarily to guide further experimental validation. TIMER also provides the functionality to determine the association between gene expression, immune infiltration and clinical outcome. While the data for T CGA samples ar e dir ectly available via the TIMER portal, there is also the option for researchers to upload custom bulk RNA-seq data to analyse data generated elsewhere. TISIDB ( http://cis.hku.hk/TISIDB/ ) pr edicts r esponse to imm unothera py by integrating datasets for a gi v en target from multiple sources, including gene expression data, high-throughput CRISPR / shRNA screening to determine sensitivity to T-cell-mediated killing and literature mining of se v eral thousand pub lica tions ( 46 ). W hen querying a target of inter est, TISIDB pr esents associated information in a multi-tab format, which includes text mining results for evidence in the literature that links the target to imm unothera p y r esponse. The output also comprises omics data from clinical samples to identify correlations between pre-imm unothera py treatment target expression and immune-relevant cancer subtype or between target expr ession, mutation, cop y number or methylation status with response to immune checkpoint inhibitors, abundance of tumour infiltrating lymphocyte subtypes, expression of known immunomodulators or chemokine expression. Like TIMER, TISIDB provides correlation data as an easy to visualize and interpret heatmap. For example, a heatmap is used to visualize the correlation of PDCD1 (encodes PD-1 protein) expression and known immunomodulators across a range of human cancers (Figure 2 C). Red squares indicate positi v e correlation between PDCD1 e xpression and expression of named imm unomodulators, w hich for a novel target could provide strong evidence of a potential link to the tumour immune response, to be followed up experimen-tally. This dataset can be further probed by visualization of correlations with specific immunomodulators within a cancer indication of interest, as illustrated by the correlation of PDCD1 and CD27 expression in lung adenocarcinoma (Figure 2 D), as has been previously shown for breast cancer ( 47 ). TISIDB also provides the ability, within the 'Imm unothera py' tab, to identify target gene expression differences between responders and non-responders to immune checkpoint inhibition from patient samples. Howe v er, low patient numbers are a general feature of studies housed within TISIDB, which limits the ability to identify statistically significant differences. Overall, TISIDB provides a comprehensi v e ov ervie w that supports detailed evaluation of the potential role of a novel target in the cancer immune response.
TIMER and TISIDB hav e been utilized effecti v ely as tools to identify correlations between target expression ( 48 , 49 ) or DNA methylation status ( 50 ) and tumour immune cell infiltration and prognosis. This demonstrates the power of such correlati v e analysis to identify biomarkers or novel targets that regulate the tumour immune response.
The gene expression datasets used by TIMER and TISIDB are primarily from bulk RNA-seq experiments, and the deconvolution methods employed only estimate the heterogeneity of the cell types present within the sample. This has led to a shift towards single-cell RNA-seq (scRNAseq) to gi v e accurate expression profiles of individual cells. Until recentl y, scRN A-seq datasets have been notoriously difficult to access and analyse without considerable programming knowledge. Although publications r equir e data to be deposited in repositories such as Gene Expression Omnibus ( 51 ) and the Database of Genotypes and Phenotypes ( 52 ), there is little consistency in formatting or e v en a t wha t stage of processing the da ta are deposited, which complica tes da ta e xtraction and analysis. Se v eral groups have produced online portals to facilitate easier access to scRNA-seq datasets including the Single Cell Expression Atlas ( https://www.ebi.ac.uk/gxa/sc/home ) by EMBL-EBI ( 53 ) and CellxGene ( https://cellxgene .cziscience .com/ ) from the Chan Zuckerberg Biohub project ( 54 ). The datasets provided by Single Cell Expression Atlas are limited, with only 131 human and 111 mouse experiments available at the time of writing. The tool allows expression of a gene of interest to be projected on to the UMAP ( 55 ) or tstochastic neighbourhood embedding ( 56 ) plots (the two main graphical methods for scRNA-seq data visualization) of a chosen dataset. Howe v er, metadata annotations can be variable, with some datasets critically missing cell type assignments. CellxGene houses a much larger collection of datasets with > 600 human studies currently available. The explore function provides extensive options for annotating the UMAP of each dataset with study parameters and author cell assignments but also importantly includes quality control metrics such as mitochondrial fraction and unique molecular identifier counts for users to confirm data quality. As with other tools, specific genes can be plotted on the UMAP, but CellxGene also offers a gene set function that allows users to plot multiple genes together, which is useful for exploring gene signatures.
6 NAR Cancer, 2023, Vol. 5, No. 3 DISCO contains a large collection of human scRNA-seq da tasets integra ted into tissue-specific 'a tlases'. Each a tlas can be queried by gene ID with outputs gi v en as UMAPs or violin plots by cell type. Datasets can also be queried in a pan-tissue manner, which can be informati v e when e valuating a target for which the appropriate tissue context may not yet be kno wn. DISCO allo ws visualization of tar get expression in immune cell types across all atlases as a violin plot to gi v e an ov ervie w of distribution of target expression. Exploring the different atlases leads to the interacti v e UMAP tool that can be used to view the expression of a target gene at a cellular le v el, as shown for PDCD1 in a kidney atlas ( Figure 2 E) ( 58 ). These data can inform the selection of appropriate immune cell types in which to study target biology and direct clinical positioning efforts. There are several disease-specific atlases available, including for pancreatic ductal adenocarcinoma, ovarian cancer and triple negati v e breast cancer, which have the added functionality of being able to perform disease versus normal comparison for a gene of interest. Expansion of cancer-specific atlases in the coming years will further enhance the utility of DISCO for cancer target evaluation. DISCO also offers a 'FastIntegration' tool that allows specific datasets of interest to be integrated and analysed together. Such capabilities have until now not been possible without programming knowledge and use of tools such as the R package Seura t. Da ta from DISCO can also be downloaded as a Seurat object to facilitate more complex analysis if required.

Mouse models
An integral component of a target evaluation is to determine whether there is sufficient validation of target dependency via genetic or pharmacological methods in preclinical models that accurately model the complex nature of cancer in the physiological setting. This complex physiology includes understanding the impact of target modulation in a model that includes a proliferating tumour and components of the tumour micr oenvir onment and that r epr esents the inherent heterogeneity of human disease. Demonstra tion of ef ficacy in clinically relevant models is also used to guide clinical positioning strategies. Another key aim of target evaluation is to predict toxicity associated with target inhibition, using genetic or pharmacological methods. Mouse models have been used e xtensi v ely for each of these aims and are routinely used to predict compound efficacy and toxicity in a clinical setting. Se v eral databases compiling mouse datasets across multiple models provide different pla tforms tha t can be mined to both obtain available information and identify suitable models that can be used experimentally.
The International Mouse Phenotyping Consortium (IMPC) ( https://www.mousephenotype.org ) has been established between 21 r esear ch institutes with the aim of creating murine knockouts for e v ery protein-coding gene within the mouse genome ( 59 , 60 ). Knockout generation and characterization are all performed within the consortium. Standard pipelines for phenotypic characterization are applied, enabling valid comparison between all knockout models. If homozygous knockout mice are viable, then extensive phe-notyping of the early adult will take place between 9 and 15 weeks, or if not viable, then heterozygotes will be characterized, and the stage of embryonic lethality of the homozygotes will be determined. Querying IMPC by gene name will bring up a link, if available, to a graphical r epr esentation of the overall phenotypic characterization where 20 different phenotypic outputs are coloured to represent significant differences to wild type, no significant difference or not tested. The full details of the phenotypic characterization and an analysis of body weight can also be found within this page. Such phenotypic characterization data can be used to flag potential tolerability concerns associated with loss of a target protein that may be previously unknown. Since unmanageable toxicity is a key reason for attrition of targets during the drug discovery and clinical de v elopment process, it is essential that potential liabilities are flagged as part of a target evaluation. Knockout mice, embryonic stem cells or targeting vectors can also be pur chased dir ectly through this portal.
It is, howe v er, important to note that knockout mouse data are a rather crude way to evaluate potential toxicity liabilities of a therapeutic, such as a small molecule inhibitor. Many proteins have multiple functions and target knockout will ablate all functions, resulting in a phenotype that may not be r epr esentati v e of therapeutic intervention. Key parameters for toxicity are the pharmacokinetic (PK) and pharmacodynamic (PD) properties of a therapeutic agent that are also not reflected by whole body or tissue-specific tar get knockout. To xicity may also be dri v en by off-target effects of a therapeutic agent, which will not be predicted by target-directed evaluation.
The Mouse Models of Human Cancer Database (MMHCdb) ( http://tumor.informatics.jax.org/ ) is a manually curated r esour ce of se v eral types of murine cancer models hosted by The Jackson Laboratory with funding from the National Cancer Institute (NCI) ( 61 ). MMHCdb is part of the Mouse Genome Informatics consortium, first released in 1998 as the Mouse Tumor Biology Database ( 62 ). The database houses data from over 46 000 models from nearly 7000 different cohorts of mice. Extensi v e efforts have been made to provide curated, consistent data to inform selection of clinically relevant models. Data are extr acted from liter atur e or submitted dir ectly by individuals or large-scale r esear ch initiati v es. The database bridges the historical gap around gene and strain nomenclature standards from diverse sources.
The three types of mouse models with available information within MMHCdb are inbred mouse models, genetically engineered mouse models (GEMMs) and patient-deri v ed xenografts (PDXs). Information available for inbred mouse models includes an interacti v e graphical summary of the characteristic cancers observed in over 700 different inbred mouse strains. The Tumor Frequency Grid tool displays the frequency of spontaneous tumours across the different inbred strains. GEMMs are generated by introduction of murine equivalents of human cancer-associated mutations and can be used to study tumour initiation, progression and response to therapy ( 63 ). Howe v er, as illustrated by strain comparison using the Tumor Frequency Grid, the genetic background in which GEMMs are de v eloped can have an impact on the observed phenotype. It is ther efor e essential that the influence of genetic backgrounds is taken into consideration when selecting appropriate models for tr ansplantation or GEMM gener ation or when interpreting study data. To support model selection, the MMHCdb search function allows queries by gene type, cancer type and mouse strain to identify all associated studies and provide further information on tumour onset, pathology and sites of metastasis.
PDX models are generated by implantation of human tumour tissue in an immunodeficient or humanized mouse. In collaboration with EMBL-EBI, MMHCdb co-de v eloped the PDX Finder r esour ce that serves as a global catalogue of PDX models ( 64 ). The PDX Finder tool has since grown to include cancer cell line and organoid models and is now available as a stand-alone da tabase, Pa tient Deri v ed Cancer Models Finder ( https://www.cancermodels.org ), that can be used to identify clinically relevant patient-derived model systems for a gi v en disease area and explore associated characterization. Howe v er, the majority of PDX models available within the MMHCdb are from the immunodeficient NSG host strain and ther efor e will not provide insight into potential interactions with the tumour immune response.
Syngeneic mouse models, where murine tissue or cell lines are transplanted into immune competent mouse models, allow the study of the tumour immune response in a complex setting. The Tumor Immune Syngeneic MOuse (TISMO) ( http://tismo.cistrome.org/ ) database hosts datasets from 137 public syngeneic mouse model studies, comprising over 1500 samples from 68 different models ( 65 ). These models, howe v er, do not cover all cancer indications, with brain cancers , for example , having no r epr esentati v e models within TISMO. In addition to manually curated model characterization, including details of cell line genotype and cancer type, mouse genetic background and implantation site, TISMO provides interacti v e visual interfaces to explore datasets for gene expression, immune cell infiltrate and response to therapy, in both treatment na ïve and immune checkpoint blockade treated models.
Ther e ar e se v eral specific features of TISMO worth highlighting. The first is the 'Pathway' tab that allows comparison between different biological pathways, from KEGG, GO cellular compartment, WikiPathways, GO molecular function, GO biological process, Reactome and MSigDB C7 imm unolo gic signa ture, across dif ferent tumour models and between pre-and post-treatment with immune checkpoint inhibitor treatment. These data can provide evidence that a target of interest plays a role in the response to specific therapies in mouse tumour models with specific genetic backgrounds. TISMO also allows upload of user gene sets for custom anal ysis, w hich is a powerful feature when evaluating the role of a novel target in the tumour immune response. Within the 'Infiltrate' tab, users can compare immune cell infiltration le v els across different tumour models, between pre-and post-treatment, and between immune checkpoint inhibitor responders and non-responders. As discussed in the immune cell profiling section, immune cell infiltrates are not measured directly but rather estimated using deconvolution algorithms and ther efor e the data output should be used to guide further experimental validation. The main drawback compared to human databases is the limited number of immune cell types and signatures available to be assessed. TISMO currently only allows analysis of CD8 + T -cell, CD4 + T -cell, macrophage, dendritic cell, Bcell and neutrophil infiltrates.
Syngeneic mouse models are e xtensi v ely used in imm une oncolo gy studies, generating an e v er-e xpanding volume of gene expression, immune infiltration and treatment response data. The field has suffered from a lack of systematic collection and variation between analysis methods, which is being addressed by databases such as TISMO. TISMO is currently the only database with a comprehensi v e collection of datasets from syngeneic mouse tumour models. This database has also recently been used to support machine learning on syngeneic mouse tumour profiles to model clinical imm unothera p y r esponses ( 66 ). Additional fea tures tha t w ould enhance the utility of TISMO w ould be to allow the upload of propriety databases for analysis, enable correlation assessments between gene expression profiles and immune infiltration le v els, and include availab le scRNA-seq datasets.

Identification and profiling of tool compounds
Small molecule chemical probes are used in drug discovery as an orthogonal approach to genetic techniques in cellular and animal models in order to predict efficacy, assess tar get-related to xicity and explore tar get biology (Figure 1 ) ( 67 ). These tool compounds are usually small molecule inhibitors, but can be r eceptor antagonists, r eceptor agonists or other modulators, such as proteolysis-targeting chimeras (PRO TACs; see below). Biolo gical agents such as thera peutic antibodies can also be used for similar aims, but for the purposes of this re vie w we focus on pharmacological agents.
The use of non-selecti v e tool compounds that are unsuitable for biological studies is common and resulting data can misinform target e valuation. An e xample of a nonselecti v e compound that is still widely used is the nonselecti v e PI3 kinase inhibitor LY294002 ( 68 ). Online resour ces have ther efor e been de v eloped to help r esear chers select and use the best tool compounds for their studies ( 69 ). Such information can also be used to evaluate how informati v e e xisting literatur e or scr eening data utilizing compounds may be.
Pr obe Miner ( https://pr obeminer .icr .ac.uk ) was de v eloped by the Institute of Cancer Research to objecti v ely identify the most suitable tool compounds ( 70 ). It uses chemical and bioactivity data from large-scale public databases such as BindingDB ( https://www.bindingdb.org/ ) and ChEMBL ( https://www.ebi.ac.uk/chembl/ ) to assess over 1.8 million compounds (70)(71)(72)(73). Probe Miner integrates fitness scores for cellular potency, target selectivity, permeability, structur e-activity r elationships, inacti v e analogues and pan-assay interference to automatically rank compounds for a particular protein target ( 70 ). Probe Miner displays a distribution of the top 20 rated chemical probes together with a compound viewer containing the chemical structure and a radar plot highlighting the strengths and limitations of each chemical tool. A direct link is also provided to common probes recommended by the Chemical Probes Portal so that the user can access guidance for best use of these re vie wed compounds. Moreov er, Probe Miner has identified high-quality compounds that have been prioritized for future appraisal by the Chemical Probes Portal (see below). Compounds that do not meet the minimum requirements (potency < 100 nM; selectivity > 10-fold against any other protein; and permeability, effects in cells at < 10 M) are flagged with a recommendation to use with caution or avoid when better compounds are available.
The Chemical Probes Portal ( https://www. chemicalprobes.org ) is a manually curated online resource for selecting tool compounds ( 67 , 69 ). It currently contains over 500 compounds that encompass > 400 protein targets from 100 protein families. These compounds have been evaluated by chemical probe experts, who provide recommendations on the best available compounds, together with guidance on concentrations and conditions for use in cellular assays and in vivo models. Where available, the portal will highlight any inacti v e compound analogues and orthogonal compounds that can be used to confirm that observed phenotypes are target engagement dependent. The portal contains links to primary literature references, vendor w e bsites and gene databases, and highlights flawed or outdated 'historical compounds' that r esear chers should avoid. For example, LY294002 is described in the Chemical Probes Portal as a 'historical compound, not to be used as a selecti v e chemical probe for a specific target'.
Additional databases that can be used to access molecular information for approved drugs and investigational compounds are DrugBank ( https://go.drugbank.com/ ) ( 74 ) and the Structural Genomics Consortium (SGC) ( https: //www.thesgc.org/chemical-probes ) ( 75 ). Using DrugBank to search by target links to known agents with activity towards that target, while searching by drug links to a wealth of information for that specific agent. This includes chemical structure, pharmacology, known drug-drug interactions, chemical properties and links to r efer ences for further information. DrugBank is freely available to use for noncommercial applications, but commercial use requires a licence. The SGC is a global partnership between academia, industry and funding agencies with one of the key aims being to create and characterize chemical probes that are made freely available with no restrictions on use. The SGC portal lists available chemical probes, sorted by target protein class. Clicking on the probe of interest links to all available information on probe properties, recommends a chemically similar negati v e control probe and provides a link to request the probe(s) of interest. Increased awareness and use of databases such as Probe Miner, the Chemical Probes Portal, DrugBank and the SGC will reduce the use of unsuitable tool compounds that can misinform target evaluation and promote the best practice of utilizing chemically similar negati v e control compounds to confirm on-target phenotypes.
PROTACs have become an increasingly attracti v e strategy for targeting 'undruggable' proteins ( 76 , 77 ). PROTAC molecules can exploit all surface binding sites and are not reliant on binding in a deep hydrophobic pocket or acti v e sites to modulate target protein activity ( 78 ). PROTACs are heterobifunctional molecules that contain a warhead small molecule that binds the protein of interest, connected via a linker molecule to a small molecule E3 ligand that recruits an E3 ubiquitin ligase to degrade the bound target pr otein. PROTAC-DB ( http://cadd.zju.edu.cn/pr otacdb/ ) is an online public da tabase tha t colla tes currently described PROTAC molecules ( 79 ) and can be queried by target, compound name / ID or chemical structure. Output is presented as a datasheet showing 2D compound structures (divided into warhead, linker and E3 ligand), biological activities (degradation capacity, binding affinities and cellular activities) and calculated physicochemical properties. The database also utilizes a computational method (PROTAC-Model) to generate predicted ternary complex models for PROTACs that exhibit good degradation capacity ( 80 ). PROTACs can be a useful r esear ch tool to provide efficacy, tolerability or clinical positioning information. New modalities such as degrader approaches can also influence tractability assessments and allow targets previously deemed to have poor tractability to be revisited.
Targeted cancer therapies act by perturbing specific molecular pathways in tumours. Howe v er, analysis of the genomes from specific tumour types has shown that tumours are highly heterogeneous and that this heterogeneity can often explain varied patient responses to targeted therapies. This broad genetic heterogeneity is also observed across cancer cell lines ( 16 ). The Broad Institute and Harvard University have developed a novel screening technology called Profiling Relati v e Inhibition Simultaneously in Mixtures (PRISM) that enables simultaneous high-throughput drug screening in large panels of genetically characterized cell lines ( 81 ). This method allows pooled screening of cell lines labelled with unique DNA bar code sequences. Bar code abundance is used to generate cell line sensitivity signatures by comparing treatment to control conditions. Predicti v e models can identify molecular fea tures tha t correla te with PRISM sensitivity profiles. The PRISM repurposing dataset ( https://depmap.org/ repurposing ) is available on the DepMap portal and contains viability da ta genera ted using the PRISM multiplexed cell line assay to screen 578 cell lines with the Broad Repurposing Library (4518 compounds) ( 82 ). A pproximatel y three quarters of the library compounds are approved clinical compounds or in clinical de v elopment, with the remainder consisting of tool compounds. This repurposing screen identified tepoxalin, a dual cyclooxygenase and lipoxygenase inhibitor, that selecti v ely killed cell lines with elevated expression of the multidrug resistance protein, MDR1 ( 82 ). Expanding the PRISM drug repurposing resource to cover more compounds and cellular models would support repurposing of existing drugs into future cancer therapies. The output can also inform clinical positioning strategies for novel targets by identifying cell line features associated with sensitivity to tool compounds.

Clinical
A key aim of target evaluation is to identify a clinical positioning strategy for a compound de v eloped against a specific target. Patient omics profiling information, including gene mutation or target expression profiles, can be correlated with disease-relevant outcome(s) such as patient survival to refine this clinical positioning strategy, with the aim of deli v ering precision medicine. A clinical positioning strategy can also inform the selection of relevant model systems for efficacy predictions or compound testing. In this section, we describe key da tabases tha t house pa tient omics and survival data and discuss how these can be utilized.
The combination of cost-effecti v e ne xt-generation sequencing together with large-scale cancer genomic efforts, such as TCGA and the International Cancer Genome Consortium (ICGC), meant that online platforms were needed to integrate the e v er-increasing datasets generated and make them readily accessible to the wider cancer research community. TCGA was initiated in 2006 as a joint effort between the NCI and the National Human Genome Research Institute to create a comprehensi v e 'atlas' of cancer genomic profiles by cataloguing cancer-causing genome alterations in the most prevalent human tumour types ( 83 ). During the subsequent 16 years, the initiati v e has genera ted multi-omics da ta, including gene expression, DN A m utation, copy number variant and DNA methylation, from over 20 000 primary cancer and matched normal samples across 33 cancer types ( 84 ). TCGA data can be accessed through the Genomic Data Commons data portal ( https: //portal.gdc.cancer.gov/ ). The portal provides different navigation options for browsing available datasets to view summaries of data for each project, explore data at the case, gene and mutation le v els, or compar e differ ent cohorts or clinical variables of a specific cohort. One limitation for cancer versus normal comparisons from the TCGA is that the number of samples from adjacent normal tissue is often far lower than that for tumour samples, which reduces the statistical power of the analysis. An alternati v e non-tumour gene expression resource that can be used for comparison purposes is the Genotype-Tissue Expression project ( https: //gtexportal.org/home/ ) that has gene expression data for 54 normal tissue types from close to 1000 individuals.
The cBio Cancer Genomics Portal ( https://www. cbioportal.org/ ) was de v eloped at the Memorial Sloan-Kettering Cancer Center to enable the visualization and in-depth analysis of multi-omics patient data for various types of cancer ( 85 , 86 ). It houses data from the entire T CGA P an-Cancer Atlas, comprising over 10 000 samples ( 87 ) as well as additional data from over 200 published studies with almost 70 000 patient samples that have been curated to ensure there is no redundancy between the studies. Users can query selected cancer studies to visualize the available omics data for single or multiple genes across patient samples. For example, querying the TCGA lung adenocar cinoma P an-Cancer Atlas stud y for omics da ta pertaining to RAS pathway members generates a series of reports, including a summary of genomic alterations (Figure 3 A). These data demonstrate that KRAS, HRAS, NRAS and BRAF gene alterations are m utuall y e xclusi v e, pointing to the shared functional relationship between pathway members. This can then be explored further using the 'Pathways' tab that provides a schematic of signalling pathway(s) and functionally linked proteins to the user's target query, together with details of alteration frequency for all related targets (Figure 3 B). Mutual e xclusi vity of mutations in cancer can be used to identify vulnerabilities that can be exploited thera peuticall y, such as the observation that cyclin E1 amplification is m utuall y e xclusi v e with BRCA1 mutation in high-grade serous ovarian cancers and that BRCA1 is selecti v ely r equir ed for survival of cyclin E1 amplified cells ( 88 ). Such studies demonstrate the power of this analysis to identify clinical positioning opportunities.
Although the previously described databases focus on mRNA e xpression le v els in cancer, the vast majority of anti-cancer drugs are designed against protein targets and changes in protein activity are the major dri v er of cancer progression. The NCI's Clinical Proteomic Tumor Analysis Consortium (CPTAC) ( https://proteomics.cancer. gov/programs/cptac ) utilizes large-scale mass spectrometrybased methods to characterize the proteome of patient samples. This initiati v e was built on transcriptomic data from T CGA to characterize color ectal, br east and ovarian cancer samples (89)(90)(91). These initial studies demonstrated that mRNA le v els do not correla te accura tely with protein le v els and that proteomic analysis could be used to further refine pa tient stra tifica tion and identify novel therapeutic targets. The CPTAC dataset has since expanded into se v eral other indications and now comprises proteomic characterization of over 2000 patient samples (92)(93)(94)(95)(96)(97). In addition to measuring protein le v els, proteomic datasets also include information on post-translation modifications (PTMs). These data have been used to identify downstream targets of KRAS in pancreatic cancer ( 95 ) and to suggest that inhibition of Rb phosphorylation could be a viable therapeutic strategy in colorectal cancer ( 98 ). Recent re-analysis of CPTAC datasets using more powerful cloud computing methods has identified PTMs that were present at lower le v els than could be previously identified and from a wider range than were pr eviously consider ed ( 99 ), suggesting that the full potential of the CPTAC datasets is yet to be realized.
For the purposes of cancer target evaluation, the University of ALabama at Birmingham CANcer data analysis portal (UALCAN) ( http://ualcan.path.uab.edu/analysis-prot. html ) ( 100 ), a user-friendly w e b portal to analyse CPTAC datasets, houses proteomic data obtained from analysing 2002 patient samples from 17 separate studies ( 101 ). The portal allows determination of target protein le v els and phosphoryla tion sta te across a range of cancer types in comparison to adjacent normal tissue. It can also be used as a user-friendly method of accessing TCGA gene expression and clinical survival data.
The major limitation of current proteomic databases for target evaluation is the limited sample size compared to that available for transcriptomics. Mining datasets comprising hundreds, or low thousands, of patient samples spread across se v eral indica tions reduces the sta tistical power to detect significant alterations in protein le v el and / or PTM status within a single target indication.
An alternati v e initiati v e to map the abundance of human pr oteins acr oss human tissues using antibody-based imaging and mass spectrometry approaches is the Human Protein Atlas ( https://www.proteinatlas.org/ ). The Human Protein Atlas began by characterizing the expression and localization of 700 proteins across 48 normal human tissues and 20 cancer types using tissue microarrays ( 102 ) and has progressed to providing expression and localization data for over 16 000 proteins in all major tissues and organs ( 103 ). An extension of this initiative is the Human Patholo gy Atlas ( 104 ), w hich houses mRN A (from TCGA) and protein (from antibody-based imaging) le v els for several cancer types. This dataset can be used as independent confirmation of data from CPTAC and knowledge of target protein le v els in normal tissue may flag key tissue-specific or general housekeeping functions that could inform tolerability assessments. A future expansion of the Human Protein Atlas that would be of interest would be to build on the PTM analysis of CPTAC and use validated PTM-specific antibodies for cancer-associated proteins to identify and characterize PTM e v ents that could be targeted in cancer. cBioPortal, TCGA and the Human Protein Atlas datasets allow correlation of omics data with available patient data, providing evidence that target expression and / or muta tion are associa ted with clinical outcome. A simple, user-friendly portal that houses manually curated datasets from the Gene Expression Omnibus ( https://www.ncbi.nlm. nih.go v/geo/info/o v ervie w.html ), the European Genome-Phenome Archi v e ( https://ega-archi v e.org/ ) and TCGA is also available: The KMplot w e b tool ( https://kmplot.com/ analysis/ ) allows users to upload their own datasets or mine available miRNA and mRNA expression data and correlate with patient outcome ( 105 ). The mRNA expression data from RNA-seq analysis, for e xample, hav e data for > 7000 patient samples across all indications correlated with overall survival. KMplot also allows the user to further restrict the analysis by cancer subtype, including sex, grade, mutation burden or enrichment of specific immune cell subtypes. The output is presented as a K aplan-Meier survi val plot for high / low expression of the target of interest together with a hazard ratio and logrank P -value to determine statistical significance. RalA is a GTPase that functions downstream of KRAS. KMplot has data from 259 patients classified as having low RALA expression and 111 with high RALA expr ession. Unr estricted analysis demonstrates that there is a statistically significant correlation between high RALA expr ession and r educed ov erall survi val in hepatocellular carcinoma (Figure 3 C), consistent with published data ( 106 ). Da ta demonstra ting tha t high target expr ession corr elates with poor outcome in a specific disease setting can inform clinical positioning assessments.

Comprehensive target evaluation databases
Ther e ar e now se v eral open-sour ce compr ehensi v e da tabases tha t allow r esear chers to identify and prioritize potential drug targets for further study (107)(108)(109). Such databases aim to integrate information from se v eral of the databases described above into a single easy-to-use platform.
The Open Targets Platform ( https://platform. opentar gets.or g/ ) was de v eloped as a pub lic-pri vate partnership between se v eral European r esear ch institutes and pharmaceutical companies ( 107 ). It integrates data from 22 public sources and scores target-disease associations based on information regarding the target, disease, mutation, known drugs, animal models and scientific literature. The Open Targets Platform can be queried by target, drug, specific disease or phenotype ( 110 , 111 ). Searching by target generates a graphical visualization illustrating disease associations grouped by therapeutic area, together with a separate profile page of available information for the target. This profile includes details on known drugs (investigational or approved) for the target, reported safety effects , chemical probes , target expression (RNA and protein), molecular information and pathways, and phenotypes linked to mouse homologues. Also included in the profile page is a target tractability assessment for small molecule, antibody, PROTAC and other therapeutic modalities. Searching on a specific disease presents a list of targets associated with that disease that can be prioritized for further appraisal. Querying the platf orm f or a specific drug shows its mechanism of action, investigational and approved indications, clinical trials, pharmacovigilance and all scientific literature associated with that drug.
Tar getDB ( https://github.com/sdecesco/tar getDB ) is a resource that allows users to ra pidl y query m ultiple public da tabases and genera te an integra ted summary of available information for a target ( 108 ). It is distributed as a Python package with SQLite database and can be downloaded by following the installation instructions on GitHub. TargetDB uses data obtained from public databases and sources such as ChEMBL, Open Targets, Protein Data Bank (PDB), PubMed, the Human Protein Atlas and UniProt ( 108 ). When queried for a single gene target, it generates an Excel file containing numerous worksheets with different target information: general information plus a spider plot summary of available data; PubMed search results listing the 500 most recent publications; disease information showing protein and GWAS associations; disease areas and association scor es; protein expr ession; mouse genotypes and associated phenotypes; isoforms; observed variants and mutants; and a list of available crystal structures for the target. Also included is an analysis of potential small molecule binding pockets present in the target together with a ligandability scor e. Commer cially available bioactive compounds are also listed with SMILES strings, target affinities and links to vendor w e bsites. Tar getDB allo ws queries for multiple targets using the list mode function to generate a report containing information for each target, or spider plot mode that outputs target information as a graphical r epr esentation.
The canSAR knowledgebase ( https://cansar.ai/ ) is the largest publicly available online r esour ce for translational cancer r esear ch and drug discovery ( 109 , 112 ). It was first released in 2011 ( 113 ) by the Institute of Cancer Research and has recently recei v ed major updates ( 109 , 112 ). This new edition integrates multi-omics patient and cell line data (ICGC and TCGA) with genetic mutation and dependency data (DepMap). The annotated human proteome (UniProt Swiss-Prot) together with data on protein 3D structures (PDB and AlphaFold), PPIs (IMEx Consortium and others), medicinal chemistry (ChEMBL and BindingDB) and clinical trials (ClinicalTrials.gov) is used by machine learning algorithms to assess ligandability of the target. Searching the canSAR knowledgebase by target takes the user to a dashboard that includes molecular synopsis pages for ligandability, signalling, disease types, experimental tools, features and chemical tools associated with the target. The 'Disease types' page shows a word cloud of cancer indica-tions that are sized according to a cancer-target association scor e. Ther e ar e clinical, mutation, copy number, expression and combined molecular scores available for each target with the score corresponding to how strong the link is between that particular target feature and the specified cancer target indication. This is a particularly powerful feature that allows previously unknown associations between the target and a disease setting to be identified. The 'Liganda bility' ta b provides a top line summary of areas of the protein associated with druggable cavities, a link to associated structural data and links to any approved or investigational drugs and chemical tools that can be used. canSAR labels associated chemical tools as 'recommended' or 'acceptable', depending on selectivity profiles, and provides direct links to the Chemical Probes Portal and Probe Miner for additional information. The 'Experimental Tools' synopsis page includes information on expression systems for e xpressing acti v e tar get protein, kno wn tar get engagement biomarkers and cell lines ranked on target expression, genetic dependencies and chemical dependencies identified with tool compounds known to act on the target. Additional features available for each target include a target gene famil y clado gr am, target inter actome and association of target mutation or expression with cancer indicationspecific immune subtypes as defined by Thorsson et al. ( 114 ).
The comprehensi v e data bases described a bove of fer pla tforms that can quickly provide an ov ervie w of a wide range of available datasets for a gi v en target in cancer and should be considered valuable tools for cancer target evaluation.
Howe v er, no single platform currently covers all the datasets and tools highlighted in this re vie w and ther efor e outputs from the various databases described should be considered together as part of a comprehensi v e target evaluation. The individual databases can also offer additional features or flexibility in data analysis that may provide further information.

Artificial intelligence in drug discovery
Although not the focus of this re vie w, is it impossible to ignore the growing interest in artificial intelligence (AI) and the potential utility across all stages of the drug discovery process. The area has been covered by se v eral recent comprehensi v e re vie ws (115)(116)(117) that discuss the potential of AI to increase the efficiency of the drug discovery process in target identification, protein structure and druggable site predictions, de novo small molecule design and predictions of drug toxicity. We will ther efor e onl y briefly cover a pplica tions of AI tha t we see having a clear impact in the immediate future.
AI offers the opportunity to ra pidl y integrate the extensi v e datasets from multiple r esour ces, such as those described in this re vie w, to either identify novel targets or evaluate proposed targets in an unbiased manner ( 116 ). In the coming years, omics datasets are only to get larger and more complex. AI offers the potential to integrate and interpret complex outputs from multiple sources to produce unbiased outputs. Indeed, ther e ar e se v eral drug discov ery companies that currently utilize AI platforms for target identification that have assets in clinical development ( 118 ). As described in previous sections, toxicity is a key reason for attrition of anti-cancer agents during the clinical de v elopment process. Current methods to predict toxicity, including evaluating mouse knockout phenotypes or comparing expression or target dependency in cancer versus cells of a non-malignant origin, have limited utility and make it challenging to effecti v ely predict tolerability as part of a target evaluation. This is a key area of importance for the field as more accurate toxicity predictions may better predict the likelihood of clinical success. A potential application of AI will be to make more accurate predictions of toxicity based on predictions of compound PK / PD properties and off-target effects together with liabilities associated with specific chemical characteristics. A r ecent compr ehensi v e re vie w highlights the current status of toxicity prediction models and challenges associated with de v elopment in this area ( 119 ). A gold-standard accessible tool in this space will be of immense importance to support the design of compounds with minimal toxicity risks.
The final AI application we wish to highlight is in protein structure predictions. Structural inf ormation f or a target can be used to identify small molecule binding pockets and to guide compound design in a rationale manner. Available structural information is ther efor e a key factor in tractability assessments. Currently, the Worldwide Protein Data Bank ( http://www.wwpdb.org/ ) is the primary source for such information as the main repository for protein structural data ( 120 ). Howe v er, cov erage of the human proteome is not complete and there may not be available in-f ormation f or a novel target. The AlphaFold Protein Structure Database ( https://alphafold.ebi.ac.uk/ ), which has been de v eloped by DeepMind from Google, uses AI to predict protein structural data from primary protein sequence, with almost complete coverage of the human proteome. AlphaFold predictions have been evaluated and found to corr elate r eliab ly with e xperimental data (121)(122)(123)(124)(125). Structural enablement of a drug discovery project allows predictions to be made regarding the feasibility of identifying a ligand that can bind with sufficient affinity to modulate the activity of the target of interest. It is important to recognize, howe v er, that predictions made by AlphaFold will not currently account for potential conformational changes induced by PPIs, post-transla tional modifica tion or small molecule binding, for e xample. An e xample of the application of AlphaFold comes from the de v elopment of a novel small molecule inhibitor of CDK20, which utilized AI technology to design small molecules based on the structures predicted by AlphaFold ( 126 ).
Amid the excitement and significant investment in AI approaches, a note of caution is that ther e ar e curr entl y no a pproved drugs against targets that have been identified primarily through AI or approved drugs that have been AI designed. This is likely to change in the near future, gi v en the recent advances in AI technology and the time taken to move through the drug discovery process. In the coming years, comparisons between costs and timelines of AIdri v en approaches to conventional drug discovery will allow determination of where AI can make the biggest impact in expediting the drug discovery process.

COMBINING DA T ABASE OUTPUTS FOR PRACTICAL PURPOSES
To demonstrate how outputs from a defined set of databases described within this re vie w can be utilized in a practical setting, we use an example to illustrate how combining outputs can support evaluation of a novel target and also demonstrate how we can retrospecti v el y anal yse clinical trial failures to determine whether this could have been predicted via in silico analysis.
To illustrate how database outputs can support target evaluation and guide further target validation and clinical positioning efforts, we use NME6 as an example of a novel anti-cancer target. NME6 is a mitochondrial nucleoside diphosphate kinase whose primary reported function is to ensure sufficient supply of ribonucleotides to the mitochondria. A PubMed search of 'NME6 AND Cancer' gi v es only thr ee r esults, none of which provide sufficient data to identify NME6 as a novel anti-cancer drug target. A recent bioRxiv preprint proposes that NME6 regulates mitochondrial gene expression and should be considered a novel target in diseases in which mitochondrial gene transcription is altered, including cancer ( 127 ).
A minimal set of databases can be used to support the unbiased evaluation of NME6 as a novel anti-cancer drug target (Figure 4 A). DepMap output from CRISPR screens identified li v er cancer as being the most enriched indication for cancer cell intrinsic NME6 dependency (Figure 4 B). The UALCAN portal used to visualize TCGA patient gene expression da ta demonstra tes tha t NME6 gene expression is increased in cancer tissue relati v e to adjacent normal samples in se v eral indications, with one of the most striking upregulations being observed in liver hepatocellular carcinoma (Figure 4 C; LIHC is boxed for clarity). KMplot was then used to determine whether higher NME6 expression was associated with patient outcome across 20 separate indications. Ther e wer e four indications wher e high NME6 expression showed a significant correlation with poor outcome, with the strongest association being observed in li v er hepatocellular car cinoma (Figur e 4 D). The IMPC mouse knockout database was used to flag adverse phenotypes associated with NME6 knockout, which may be indicati v e of toxicity lia bilities. Homozy gous NME6 knockout results in embryonic lethality, pointing to an essential role in embryonic de v elopment, while heterozygous knockout results in minimal observed phenotypes in adult mice (Figure 4 E). Ther e ar e no experimentally determined NME6 protein structures available within the PDB, but predicted structur es ar e available via AlphaFold (Figur e 4 F). Together, this set of data provides support from cancer cell line and clinical da tasets tha t NME6 function may be linked to li v er cancer progression. Available mouse knockout data do not flag any clear toxicity liabilities associated with heterozygous knockout in the adult mouse and AlphaFold structure prediction gi v es a starting point for computational chemistry methods to identify putati v e drug-binding pockets and perform detailed ligandability assessments. Se v eral additional databases described within this re vie w can provide further mechanistic insight into target biology, guide the selection of clinically relevant pre-clinical model systems and guide potential NME6 targeting strategies. This minimal set of data from freely available databases addresses se v eral of the key outputs of a target evaluation study (Figure 1 ), which mitigates the risk associated with committing r esour ce to further exploration of NME6 as a novel anti-cancer drug target.
A key objecti v e of the use of databases for cancer target evaluation is to reduce the high attrition rate associated with clinical de v elopment of nov el compounds. As highlighted in previous sections, accurate toxicity predictions are challenging with currently available tools. However, output from the databases described within this re vie w can inform clinical de v elopment. To illustrate how integration of database outputs could have been used to predict differing clinical outcomes between alternati v e indications, we evaluate epidermal growth factor receptor (EGFR) targeting in glioblastoma (GBM). cBioPortal analysis shows that EGFR gene alterations ar e pr esent in a pproximatel y half of GBM patient samples, with gene amplification and mutation being the most prevalent alterations observed (Figure 5 A). The high rates of EGFR alterations gave optimism that targeting EGFR could be a therapeutic vulnerability of GBM and was supported by in vitr o da ta from GBM cell lines ( 128 ). Howe v er, while EGFR inhibitors have approval for treatment of EGFR mutant NSCLC, they have failed to demonstra te ef ficacy in GBM ( 129 , 130 ). Multiple reasons ma y account f or failure in the GBM setting. The anatomical properties of the blood-brain barrier can limit drug penetrance within GBM and compensa tory muta tions of other receptor tyrosine kinases may dri v e resistance. cBio-Portal analysis of mutation hotspots shows that EGFR mutations in GBM are most frequently found within the extracellular furin-like domain, w hile EGFR m utations in lung cancer typically occur within the intracellular kinase domain (Figure 5 B). These data suggest that first-and secondgeneration EGFR inhibitors that target the kinase domain are not targeted towards the EGFR mutations observed in GBM and explain to some extent lack of efficacy. The efficacy of EGFR inhibition is likely to at least in part be dri v en by the reshaping of the tumour immune micr oenvir onment, with higher infiltration and proliferation of anti-tumour T cells observed in mouse models following EGFR inhibition ( 131 ). GBM is well recognized as a 'cold' tumour type, with low infiltration of activated T cells, where the striking efficacy of imm unothera pies observed in some other indications has not yet been achie v ed ( 132 ). Mining the TIMER database for associations between gene alteration and immune cell infiltrate illustrates that EGFR mutation or amplification in GBM patient samples does not correlate with any change in T-cell infiltration (Figure 5 C), in contrast to lung adenocar cinoma, wher e EGFR alteration is associated with reduced T-cell infiltration (Figure 5 D). Together, these database analyses predict that, despite high le v els of EGFR genetic alteration observed in GBM, the lack of kinase-activa ting muta tions and lack of impact of EGFR alteration on the tumour immune micr oenvir onment suggest that EGFR kinase inhibitors are likely to have differing efficacy in EGFR mutant NSCLC and GBM. Such analysis may point to a need to generate compounds that target the EGFR extracellular mutation to target EGFR mutant GBM ( 133 ) or to combine with additional agents that promote increased T-cell infiltration.

PERSPECTIVES
The breadth of free-to-use publicly available cancer databases described within this re vie w (Tab le 1 ) allows a comprehensi v e e valuation of a nov el target to be carried out in a short timeframe and without the need for specialist training. The key aims of target evaluation (Figure 1 ) are to determine whether the risks associated with a specific target programme are suf ficiently mitiga ted b y av ailable information to support progression into the drug discovery process. Utilization of the databases described within this re vie w can increase the information available and support assessments of existing data to inform target evaluation and consequently reduce the high attrition rate associated with nov el drug de v elopment. Effecti v e mining of databases can also be used to identify previously unknown clinical opportunities for a selected target or repurposing opportunities for an existing drug. Incorporation of patient omics data and effecti v e utilization of clinically relevant pre-clinical models pro vide k ey data to support the de v elopment of precision medicine on a patient-specific basis.
It is important, howe v er, tha t da tabase output is understood in the context of the specific limita tions tha t we discuss throughout. For example, databases describing the impact of target depletion in cancer cell lines or in immune compromised mouse models will not be of use when predicting a target's link to a role in the tumour immune response. Similarly, da tabases tha t house compound screening datasets will only be informati v e when the selecti vity profile of the compounds used is known. Ther e ar e also knowledge gaps that could be filled by the creation of new da tabases. Da tabases tha t would be of key importance may include housing datasets for cancer cell line screening under physiolo gicall y relevant cell culture conditions, a portal that allows comparison between target knockout phenotypes in cancer and non-cancer (including stromal and immune) cells, and a da tabase tha t compiles in vivo target depletion / deletion data from tumour models.

DA T A A V AILABILITY
No new data were generated or analysed in support of this r esear ch. All databases used to generate data have been cited and a URL provided.